Multi-Action Dialog Policy Learning from Logged User Feedback
نویسندگان
چکیده
Multi-action dialog policy (MADP), which generates multiple atomic actions per turn, has been widely applied in task-oriented systems to provide expressive and efficient system responses. Existing MADP models usually imitate action combinations from the labeled multi-action samples. Due data limitations, they generalize poorly toward unseen flows. While reinforcement learning-based methods are proposed incorporate service ratings real users user simulators as external supervision signals, suffer sparse less credible dialog-level rewards. To cope with this problem, we explore improve MADPL explicit implicit turn-level feedback received for historical predictions (i.e., logged feedback) that cost-efficient collect faithful real-world scenarios. The task is challenging since provides only partial label limited particular predicted by agent. fully exploit such information, propose BanditMatch, addresses a feedback-enhanced semi-supervised learning perspective hybrid objective of SSL bandit learning. BanditMatch integrates pseudo-labeling better space through constructing full feedback. Extensive experiments show our improves over state-of-the-art generating more concise informative source code appendix paper can be obtained https://github.com/ShuoZhangXJTU/BanditMatch.
منابع مشابه
Batch learning from logged bandit feedback through counterfactual risk minimization
We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfa...
متن کاملCounterfactual Risk Minimization: Learning from Logged Bandit Feedback
We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfa...
متن کاملDiscovering Action-Dependent Relevance : Learning from Logged Data
In many learning problems, the decision maker is provided with various (types of) context information that she might utilize to select actions in order to maximize performance/rewards. But not all information is equally relevant: some context information may be more relevant to the decision problem at hand. Discovering and exploiting the most relevant context information speeds up learning, red...
متن کاملJoint Optimization of User-desired Content in Multi-document Summaries by Learning from User Feedback
In this paper, we propose an extractive multi-document summarization (MDS) system using joint optimization and active learning for content selection grounded in user feedback. Our method interactively obtains user feedback to gradually improve the results of a state-of-the-art integer linear programming (ILP) framework for MDS. Our methods complement fully automatic methods in producing highqua...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2023
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v37i11.26636